Search CORE

60 research outputs found

Collecting a corpus of Dutch SMS

Author: De Clercq Orphée
Oostdijk Nelleke
Treurniet Maaske
van den Heuvel Henk
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2012
Field of study

In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies

CiteSeerX

Ghent University Academic Bibliography

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Author: De Clercq Orph´ee
Heuvel Henk van den
Jong Franciska de
Oostdijk Nelleke
Reynaert Martin
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well

CiteSeerX

Ghent University Academic Bibliography

Radboud Repository

University of Twente Research Information

Tilburg University Repository

Introducing the CLARIN-NL Data Curation Service

Author: Henk Van Den Heuvel
Nelleke Oostdijk
Publication venue
Publication date: 11/04/2020
Field of study

Abstract In this paper we introduce the CLARIN-NL Data Curation Service. We highlight its tasks and its mediating position between researchers and the CLARIN Data Centres. We outline a scenario for successful data curation and stress the need to take notice of the factors that determine the desirability and feasibility of data curation. Finally, we present and discuss an exemplary case that illustrates the relevant issues involved in setting up a data curation plan

CiteSeerX

Introducing the CLARIN-NL Data Curation Service

Author: Henk Van Den Heuvel
Nelleke Oostdijk
Publication venue
Publication date: 11/04/2020
Field of study

Abstract CLARIN-NL is a project directed at the development of a sustainable research infrastructure for the humanities and social sciences. An integral part of such an infrastructure constitute the resources (data and tools) which researchers in the various disciplines employ. Whether the infrastructure will be successful in supporting the needs of the research communities it intends to cater for depends on a number of factors. One factor is that resources that are or could be relevant to the wider research community are made visible through this infrastructure and, to the extent possible, accessible and usable. Over the past decades numerous datasets have been collected and annotated by researchers for use in their own research. Often such data sets sank into oblivion once the research results had been published, while occasionally data were actually lost. With the years it has become apparent that unless appropriate action is undertaken to actively curate existing resources, many are at the risk of being lost as individual researchers or research groups often lack the expertise and the means to take the necessary measures to ensure their future availability. By resource curation we mean the planning, allocation of financial and other means, and application of preservation methods and technologies to ensure that digital information of enduring value remains accessible and usable. It encompasses material that begins its life in digital form as well as material that is converted from traditional analog to digital formats. Digital information must be stored long-term and error-free, with means for retrieval and interpretation, for the entire time span the information is required for; in other words, it must be possible to decode and transform the retrieved files -of texts, charts, images or sound -into usable representations (cf. Hedstrom 1997). Resource curation is important -from an economic point of view; Curation is needed to prevent loss of resources that were created at substantial efforts and expenses. Loss may occur as a result of media deterioration or digital obsolescence. Costs may incur when resources are lost and resources must be rebuilt. In some cases, resources are unique and cannot be replaced if destroyed or lost. -in terms of scientific interest; Curation grants access to the resources to a wider user community, allowing researchers to share access to data sets and permit replicability in research. -for reasons of cultural heritage. From the start of the project (2009), in CLARIN-NL funding has been available for projects directed at resource curation. Although a number of curation projects were undertaken, the calls for proposals have been less successful in reaching resource producers and owners who were not already aware of and/or participating in CLARIN-NL. In October 2010 the CLARIN-NL Executive board Board therefore initiated a pilot project that should investigate the need and possibility for establishing a Data Curation Service (DCS) task force that would salvage valuable corpora and data sets that are at the risk of being lost. The idea was that a dedicated team of specialists should be made responsible for curating data residing with humanities researchers, especially those who are reluctant or incapable of undertaking th

CiteSeerX

Event Causality Identification with Causal News Corpus -- Shared Task 3, CASE 2022

Author: Caselli Tommaso
Hettiarachchi Hansi
Hürriyetoğlu Ali
Liza Farhana Ferdousi
Oostdijk Nelleke
Tan Fiona Anting
Uca Onur
Publication venue
Publication date: 01/01/2022
Field of study

The Event Causality Identification Shared Task of CASE 2022 involved two subtasks working on the Causal News Corpus. Subtask 1 required participants to predict if a sentence contains a causal relation or not. This is a supervised binary classification task. Subtask 2 required participants to identify the Cause, Effect and Signal spans per causal sentence. This could be seen as a supervised sequence labeling task. For both subtasks, participants uploaded their predictions for a held-out test set, and ranking was done based on binary F1 and macro F1 scores for Subtask 1 and 2, respectively. This paper summarizes the work of the 17 teams that submitted their results to our competition and 12 system description papers that were received. The best F1 scores achieved for Subtask 1 and 2 were 86.19% and 54.15%, respectively. All the top-performing approaches involved pre-trained language models fine-tuned to the targeted task. We further discuss these approaches and analyze errors across participants' systems in this paper.Comment: Accepted to the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2022

arXiv.org e-Print Archive

Birmingham City University Open Access Repository

BCU Open Access

University of East Anglia digital repository